Introduction

This report explores the relationship between weather patterns and crime rates in Colchester during the year 2024. Central research questions were: which season was the safest in Colchester? This leads to broader inquiries into whether crime patterns changed in response to fluctuations in weather conditions, and whether particular types of crime were more likely to occur in each seasons.

To investigate these questions, two data sets were used. The first, crime24.csv, contains crime data for Colchester throughout 2024, detailing the nature, location, and timing of reported incidents. The second, temp24.csv, provided daily meteorological data from a weather station, capturing variables such as temperature, rainfall, humidity, sunlight hours among others.

Before any analysis of the data, data pre-processing steps were carried out to ensure both data sets were clean, consistent, and properly formatted. This involved handling missing values, parsing dates, generating season-based groupings, and aggregating observations for meaningful comparisons. Once prepared, the data sets were merged using a common variable (year_month), enabling an integrated view of weather and crime in Colchester.

The analysis incorporated a variety of visual and statistical methods to explore potential relationships between weather conditions and crime. These included summary tables, bar and density plots, violin plots, scatterplots, correlation matrices, time series visualisations with smoothing, and an interactive spatial map. These tools provided an evidence-based foundation for interpreting the temporal and spatial dynamics of crime in Colchester.

Data Preparation

The data preparation stage began with the installation and loading of several R packages required for the analysis. These libraries included tidyverse, lubridate, ggplot2, plotly, and leaflet, among others, each providing essential functions for data manipulation, visualisation, and interactivity.

Following this, the two data sets crime24.csv and temp24.csv were imported into the R environment and previewed. This initial exploration provided insight into the structure and contents of each data set and guided the necessary data pre-processing steps.

It was observed that both data sets contained a number of irrelevant or redundant columns which were subsequently removed to streamline the analysis. Additionally, missing values were present in several variables. In the case of the weather data, missing temperature and rainfall values were imputed where appropriate using mean or median substitution. However, in instances where values were missing across multiple key fields or entire rows, such records were excluded to maintain data integrity.

Further cleaning included the parsing and formatting of date variables. The crime data set featured dates in a “YYYY-MM” format, while the weather data set used full “YYYY-MM-DD” dates. To enable merging and season aggregation, both date formats were standardised, and a new variable (year_month) was generated in each data set to represent monthly time intervals. A season variable was also created using the month of each observation to classify entries into Winter, Spring, Summer, or Autumn.

Finally, the cleaned data sets were merged using the year_month variable, resulting in a combined data set that allowed for comparative analysis between monthly crime levels and prevailing weather conditions. This merged data set formed the foundation for some subsequent visualisations and statistical interpretations.

Read and preview of data:

#load data sets 
crime <- read.csv("crime24.csv")
temp <- read.csv("temp24.csv")

# Preview data sets
head(crime)
##   X              category persistent_id    date      lat     long street_id
## 1 1 anti-social-behaviour               2024-01 51.89301 0.901028   2153130
## 2 2 anti-social-behaviour               2024-01 51.88979 0.898830   2153105
## 3 3 anti-social-behaviour               2024-01 51.89825 0.902107   2153147
## 4 4 anti-social-behaviour               2024-01 51.87837 0.888373   2152856
## 5 5 anti-social-behaviour               2024-01 51.87905 0.889521   2152871
## 6 6 anti-social-behaviour               2024-01 51.88860 0.899203   2153107
##                               street_name context        id location_type
## 1                  On or near Middle Mill      NA 115967607         Force
## 2 On or near Conference/exhibition Centre      NA 115967129         Force
## 3                   On or near Mason Road      NA 115967591         Force
## 4              On or near Kensington Road      NA 115967062         Force
## 5                 On or near Lambeth Road      NA 115967058         Force
## 6               On or near Trinity Street      NA 115967547         Force
##   location_subtype outcome_status
## 1                            <NA>
## 2                            <NA>
## 3                            <NA>
## 4                            <NA>
## 5                            <NA>
## 6                            <NA>
head(temp)
##   station_ID       Date TemperatureCAvg TemperatureCMax TemperatureCMin TdAvgC
## 1       3590 2024-12-31             6.5             7.7             5.0    4.4
## 2       3590 2024-12-30             5.6             6.9             3.4    4.9
## 3       3590 2024-12-29             3.3             4.9             2.2    3.2
## 4       3590 2024-12-28             4.0             5.8             2.3    3.7
## 5       3590 2024-12-27             5.3             6.7             4.3    5.1
## 6       3590 2024-12-26             6.7            10.0             5.6    6.4
##   HrAvg WindkmhDir WindkmhInt WindkmhGust PresslevHp Precmm TotClOct lowClOct
## 1  86.4        WSW       22.7        42.6     1025.3    0.0      4.5      7.2
## 2  94.9        WSW       16.7        40.8     1028.5    0.0      8.0      8.0
## 3  98.6          W       11.4        22.2     1028.5    0.4      8.0      8.0
## 4  98.4         SW        5.5        14.8     1031.8    0.4      8.0      8.0
## 5  98.4          S        6.3        16.7     1034.7    0.4      8.0      8.0
## 6  98.3        WSW        9.3        22.2     1033.6    0.4      8.0      8.0
##   SunD1h VisKm SnowDepcm PreselevHp
## 1    5.7  63.4        NA         NA
## 2    0.0  15.3        NA         NA
## 3    0.0   0.5        NA         NA
## 4    0.0   0.1        NA         NA
## 5    0.0   0.5        NA         NA
## 6    0.0   0.2        NA         NA
dim(crime)
## [1] 6304   13
dim(temp)
## [1] 366  18

The preview of data revealed there are 6304 rows and 13 columns in the crime data, while there are 366 rows and 18 columns in the temp data.

The variables in the crime data set are:

X: row indexes without variable title

category: Category of the crime (https://data.police.uk/docs/method/crime-street/)

persistent_id: 64-character unique identifier for that crime. (This is different to the existing ‘id’ attribute, which is not guaranteed to always stay the same for each crime.)

date: Date of the crime in format: YYYY-MM

latitude: Latitude coordinate

longitude: Longitude coordinate

street_id: Unique identifier for the street

street_name: Name of the location. An approximation of where the crime happened

context: Extra information about the crime (if applicable)

id: ID of the crime. This ID only relates to the API, it is NOT a police identifier

location_type: The type of the location. Either Force or BTP: Force indicates a normal police force location; BTP indicates a British Transport Police location. BTP locations fall within normal police force boundaries.

location_subtype: For BTP locations, the type of location at which this crime was recorded.

outcome_status: The category and date of the latest recorded outcome for the crime

The variables in the temp data set are:

station_ID - WMO station identifier

Date - date (and time) of observations. Format: YYYY-MM-DD

Viskm - visibility in kilometres

TemperatureCAvg - average air temperature at 2 metres above ground level. Values given in Celsius degrees

TemperatureCMax - maximum air temperature at 2 metres above ground level. Values given in Celsius degrees

TemperatureCMin - minimum air temperature at 2 metres above ground level. Values given in Celsius degrees

TdAvgC - average dew point temperature at 2 metres above ground level. Values given in Celsius degrees

HrAvg - average relative humidity. Values given in %

WindkmhDir - wind direction

WindkmhInt - wind speed in km/h

WindkmhGust - wind gust in km/h

PresslevHp - Sea level pressure in hPa

Precmm - precipitation totals in mm

TotClOct - total cloudiness in octants

lowClOct - cloudiness by low level clouds in octants

SunD1h - sunshine duration in hours

PreselevHp - atmospheric pressure measured at altitude of station in hPa

SnowDepcm - depth of snow cover in centimetres

#output data type and structure of data sets 
str(crime)
## 'data.frame':    6304 obs. of  13 variables:
##  $ X               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ category        : chr  "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ persistent_id   : chr  "" "" "" "" ...
##  $ date            : chr  "2024-01" "2024-01" "2024-01" "2024-01" ...
##  $ lat             : num  51.9 51.9 51.9 51.9 51.9 ...
##  $ long            : num  0.901 0.899 0.902 0.888 0.89 ...
##  $ street_id       : int  2153130 2153105 2153147 2152856 2152871 2153107 2152963 2152963 2153186 2153163 ...
##  $ street_name     : chr  "On or near Middle Mill" "On or near Conference/exhibition Centre" "On or near Mason Road" "On or near Kensington Road" ...
##  $ context         : logi  NA NA NA NA NA NA ...
##  $ id              : int  115967607 115967129 115967591 115967062 115967058 115967547 115967516 115967638 115967128 115967378 ...
##  $ location_type   : chr  "Force" "Force" "Force" "Force" ...
##  $ location_subtype: chr  "" "" "" "" ...
##  $ outcome_status  : chr  NA NA NA NA ...
str(temp)
## 'data.frame':    366 obs. of  18 variables:
##  $ station_ID     : int  3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
##  $ Date           : chr  "2024-12-31" "2024-12-30" "2024-12-29" "2024-12-28" ...
##  $ TemperatureCAvg: num  6.5 5.6 3.3 4 5.3 6.7 9.4 4.3 4.6 7.2 ...
##  $ TemperatureCMax: num  7.7 6.9 4.9 5.8 6.7 10 12.3 6.9 7.9 11 ...
##  $ TemperatureCMin: num  5 3.4 2.2 2.3 4.3 5.6 3.5 2.5 2.5 3.3 ...
##  $ TdAvgC         : num  4.4 4.9 3.2 3.7 5.1 6.4 8.8 1.8 -0.5 4.5 ...
##  $ HrAvg          : num  86.4 94.9 98.6 98.4 98.4 98.3 95.6 84.2 70 83 ...
##  $ WindkmhDir     : chr  "WSW" "WSW" "W" "SW" ...
##  $ WindkmhInt     : num  22.7 16.7 11.4 5.5 6.3 9.3 15.4 16.4 36.8 28 ...
##  $ WindkmhGust    : num  42.6 40.8 22.2 14.8 16.7 22.2 31.5 50 70.4 66.7 ...
##  $ PresslevHp     : num  1025 1028 1028 1032 1035 ...
##  $ Precmm         : num  0 0 0.4 0.4 0.4 0.4 0 0 0.8 0.8 ...
##  $ TotClOct       : num  4.5 8 8 8 8 8 6.8 6.7 4.3 6.6 ...
##  $ lowClOct       : num  7.2 8 8 8 8 8 6.8 7.6 5.2 6.9 ...
##  $ SunD1h         : num  5.7 0 0 0 0 0 0 1.4 2.8 0 ...
##  $ VisKm          : num  63.4 15.3 0.5 0.1 0.5 0.2 13.3 20 38.8 34.9 ...
##  $ SnowDepcm      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ PreselevHp     : logi  NA NA NA NA NA NA ...

To streamline the datasets and retain only relevant variables for analysis, several columns were removed. From the crime data, the columns X, context, location-subtype, and persistent_id were deleted. The X column contained row indices, which were not necessary for analysis. The context column consisted entirely of missing values, offering no usable information. The location-subtype column contained corrupted or malformed entries, including random quotation marks, rendering it unreliable. Similarly, the persistent_id column featured either missing values or random alphanumeric strings that lacked consistent formatting and interpretability.

From the weather data set, the columns PresslevHp and SnowDepcm were excluded. Both had substantial missing values, and preliminary checks indicated they would not contribute meaningfully to the analysis, especially as snowfall levels in Colchester are often minimal or inconsistent.

Following this structural clean-up, the data sets underwent further refinement. Missing values in numeric fields, such as temperature and precipitation, were imputed using the mean of the corresponding column. This approach preserved the data set’s size while minimising the influence of outliers or gaps in measurement.

The date formats in both data sets were also standardised. The crime data’s date variable, originally formatted as “YYYY-MM”, was parsed to generate a full date representation (“YYYY-MM-01”) for consistency. The weather data’s Date variable, already in “YYYY-MM-DD” format, was converted to a proper Date class in R. From these, a new variable—year_month—was derived in both data sets to enable monthly aggregation. Additionally, a season variable was added by mapping each observation’s month to one of the four meteorological seasons: Winter (December–February), Spring (March–May), Summer (June–August), and Autumn (September–November).

These cleaning and formatting steps ensured both data sets were aligned, enabling accurate merging and insightful visual exploration in the respective subsequent analysis stages.

#basic statistics 
summary(crime)
##        X          category         persistent_id          date          
##  Min.   :   1   Length:6304        Length:6304        Length:6304       
##  1st Qu.:1577   Class :character   Class :character   Class :character  
##  Median :3152   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3152                                                           
##  3rd Qu.:4728                                                           
##  Max.   :6304                                                           
##       lat             long          street_id       street_name       
##  Min.   :51.88   Min.   :0.8788   Min.   :2152686   Length:6304       
##  1st Qu.:51.89   1st Qu.:0.8966   1st Qu.:2153025   Class :character  
##  Median :51.89   Median :0.9013   Median :2153155   Mode  :character  
##  Mean   :51.89   Mean   :0.9029   Mean   :2153873                     
##  3rd Qu.:51.89   3rd Qu.:0.9088   3rd Qu.:2153366                     
##  Max.   :51.90   Max.   :0.9246   Max.   :2343256                     
##  context              id            location_type      location_subtype  
##  Mode:logical   Min.   :115954844   Length:6304        Length:6304       
##  NA's:6304      1st Qu.:118009952   Class :character   Class :character  
##                 Median :120228058   Mode  :character   Mode  :character  
##                 Mean   :120403000                                        
##                 3rd Qu.:122339060                                        
##                 Max.   :125550731                                        
##  outcome_status    
##  Length:6304       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
summary(temp)
##    station_ID       Date           TemperatureCAvg TemperatureCMax
##  Min.   :3590   Length:366         Min.   :-2.60   Min.   : 1.10  
##  1st Qu.:3590   Class :character   1st Qu.: 7.00   1st Qu.:10.72  
##  Median :3590   Mode  :character   Median :10.95   Median :14.75  
##  Mean   :3590                      Mean   :10.98   Mean   :15.08  
##  3rd Qu.:3590                      3rd Qu.:14.50   3rd Qu.:19.60  
##  Max.   :3590                      Max.   :23.10   Max.   :29.80  
##                                                                   
##  TemperatureCMin      TdAvgC           HrAvg        WindkmhDir       
##  Min.   :-6.100   Min.   :-6.000   Min.   :59.60   Length:366        
##  1st Qu.: 3.325   1st Qu.: 4.725   1st Qu.:75.90   Class :character  
##  Median : 6.800   Median : 8.200   Median :82.75   Mode  :character  
##  Mean   : 6.486   Mean   : 7.752   Mean   :81.74                     
##  3rd Qu.: 9.500   3rd Qu.:11.000   3rd Qu.:88.80                     
##  Max.   :16.700   Max.   :16.900   Max.   :98.60                     
##                                                                      
##    WindkmhInt     WindkmhGust       PresslevHp         Precmm      
##  Min.   : 3.90   Min.   : 11.10   Min.   : 978.9   Min.   : 0.000  
##  1st Qu.:12.22   1st Qu.: 31.50   1st Qu.:1007.5   1st Qu.: 0.000  
##  Median :15.80   Median : 38.90   Median :1013.8   Median : 0.200  
##  Mean   :16.52   Mean   : 40.81   Mean   :1013.7   Mean   : 1.864  
##  3rd Qu.:19.80   3rd Qu.: 48.20   3rd Qu.:1021.0   3rd Qu.: 1.600  
##  Max.   :42.50   Max.   :105.60   Max.   :1037.3   Max.   :38.000  
##                                                    NA's   :24      
##     TotClOct        lowClOct         SunD1h           VisKm      
##  Min.   :0.000   Min.   :1.000   Min.   : 0.000   Min.   : 0.10  
##  1st Qu.:3.800   1st Qu.:5.800   1st Qu.: 0.325   1st Qu.:20.73  
##  Median :5.600   Median :6.900   Median : 3.500   Median :30.95  
##  Mean   :5.304   Mean   :6.609   Mean   : 4.203   Mean   :31.42  
##  3rd Qu.:7.200   3rd Qu.:7.600   3rd Qu.: 7.100   3rd Qu.:41.20  
##  Max.   :8.000   Max.   :8.000   Max.   :15.600   Max.   :71.20  
##                  NA's   :5                                       
##    SnowDepcm    PreselevHp    
##  Min.   :1.00   Mode:logical  
##  1st Qu.:1.25   NA's:366      
##  Median :1.50                 
##  Mean   :1.50                 
##  3rd Qu.:1.75                 
##  Max.   :2.00                 
##  NA's   :364

Data cleaning:

#1.Drop irrelevant columns:
#crime, 13 columns - 4 = 9 columns left  
crime<- crime%>%select(-c(X, context, location_subtype, persistent_id))

#temp, 18 columns - 2 = 16 columns left 
temp<- temp%>%select(-c(PreselevHp,SnowDepcm))

#check should output 9 and 16 
length(crime)
## [1] 9
length(temp)
## [1] 16
#2.Handling Missing values (NAs):
#checking total nas in each column of both data sets  
colSums(is.na(crime))
##       category           date            lat           long      street_id 
##              0              0              0              0              0 
##    street_name             id  location_type outcome_status 
##              0              0              0            710
colSums(is.na(temp))
##      station_ID            Date TemperatureCAvg TemperatureCMax TemperatureCMin 
##               0               0               0               0               0 
##          TdAvgC           HrAvg      WindkmhDir      WindkmhInt     WindkmhGust 
##               0               0               0               0               0 
##      PresslevHp          Precmm        TotClOct        lowClOct          SunD1h 
##               0              24               0               5               0 
##           VisKm 
##               0
#where the outcome_status==NA, impute Unknown
crime$outcome_status[is.na(crime$outcome_status)]<- "Unknown"

#where the Precmm ==NA, impute column median 
temp$Precmm[is.na(temp$Precmm)]<- median(temp$Precmm, na.rm=TRUE)

#Where the lowClOct== NA, impute column median 
temp$lowClOct[is.na(temp$lowClOct)]<- median(temp$lowClOct, na.rm=TRUE)


#re-check, should be zero 
sum(is.na(crime))
## [1] 0
sum(is.na(temp))
## [1] 0
#3.date and Date variable format, and extract month num for season assignment:

#crime, extract month num so that i can assign season:
crime <- crime %>%
  #create 4 new columns date_parsed, year_month, month num and season:
  mutate(
    # Convert to Date format first
    date_parsed = as.Date(paste0(date, "-01")), #forcing date parse 
    
    #for consistency create year_month variable and assign existing date               variable to it. Because crime date is already in YYYY-MM format keep as is:
    year_month = format(date),
    
    #extract month num from date:
    month_num = as.numeric(format(date_parsed, "%m")),
    
    #assign seasons:
    crime_season = case_when(
      month_num %in% 3:5   ~ "Spring",
      month_num %in% 6:8   ~ "Summer",
      month_num %in% 9:11  ~ "Autumn",
      TRUE                 ~ "Winter"))


#temp, extract month num so that i can assign season:
temp <- temp %>%
  mutate(
    # Convert to Date format first
    Date_parsed = as.Date(Date),
    
    # Create YYYY-MM (to match crime data)
    year_month = format(Date_parsed, "%Y-%m"),
    
    # Extract month number for seasons
    month_num = as.numeric(format(Date_parsed, "%m")),
    
    # Assign seasons (same logic as crime data)
    temp_season = case_when(
      month_num %in% 3:5   ~ "Spring",
      month_num %in% 6:8   ~ "Summer",
      month_num %in% 9:11  ~ "Autumn",
      TRUE                 ~ "Winter"))

# Check crime data
crime %>% 
  select(year_month, month_num, crime_season) %>% 
  distinct()
##    year_month month_num crime_season
## 1     2024-01         1       Winter
## 2     2024-02         2       Winter
## 3     2024-03         3       Spring
## 4     2024-04         4       Spring
## 5     2024-05         5       Spring
## 6     2024-06         6       Summer
## 7     2024-07         7       Summer
## 8     2024-08         8       Summer
## 9     2024-09         9       Autumn
## 10    2024-10        10       Autumn
## 11    2024-11        11       Autumn
## 12    2024-12        12       Winter
# Check temp data
temp %>% 
  select(year_month,month_num, temp_season) %>%
  distinct()
##    year_month month_num temp_season
## 1     2024-12        12      Winter
## 2     2024-11        11      Autumn
## 3     2024-10        10      Autumn
## 4     2024-09         9      Autumn
## 5     2024-08         8      Summer
## 6     2024-07         7      Summer
## 7     2024-06         6      Summer
## 8     2024-05         5      Spring
## 9     2024-04         4      Spring
## 10    2024-03         3      Spring
## 11    2024-02         2      Winter
## 12    2024-01         1      Winter
#check data types and entries, making sure date parsed in both data sets 
glimpse(crime)
## Rows: 6,304
## Columns: 13
## $ category       <chr> "anti-social-behaviour", "anti-social-behaviour", "anti…
## $ date           <chr> "2024-01", "2024-01", "2024-01", "2024-01", "2024-01", …
## $ lat            <dbl> 51.89301, 51.88979, 51.89825, 51.87837, 51.87905, 51.88…
## $ long           <dbl> 0.901028, 0.898830, 0.902107, 0.888373, 0.889521, 0.899…
## $ street_id      <int> 2153130, 2153105, 2153147, 2152856, 2152871, 2153107, 2…
## $ street_name    <chr> "On or near Middle Mill", "On or near Conference/exhibi…
## $ id             <int> 115967607, 115967129, 115967591, 115967062, 115967058, …
## $ location_type  <chr> "Force", "Force", "Force", "Force", "Force", "Force", "…
## $ outcome_status <chr> "Unknown", "Unknown", "Unknown", "Unknown", "Unknown", …
## $ date_parsed    <date> 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-01, 2024-0…
## $ year_month     <chr> "2024-01", "2024-01", "2024-01", "2024-01", "2024-01", …
## $ month_num      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ crime_season   <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Wint…
glimpse(temp)
## Rows: 366
## Columns: 20
## $ station_ID      <int> 3590, 3590, 3590, 3590, 3590, 3590, 3590, 3590, 3590, …
## $ Date            <chr> "2024-12-31", "2024-12-30", "2024-12-29", "2024-12-28"…
## $ TemperatureCAvg <dbl> 6.5, 5.6, 3.3, 4.0, 5.3, 6.7, 9.4, 4.3, 4.6, 7.2, 4.7,…
## $ TemperatureCMax <dbl> 7.7, 6.9, 4.9, 5.8, 6.7, 10.0, 12.3, 6.9, 7.9, 11.0, 7…
## $ TemperatureCMin <dbl> 5.0, 3.4, 2.2, 2.3, 4.3, 5.6, 3.5, 2.5, 2.5, 3.3, 0.3,…
## $ TdAvgC          <dbl> 4.4, 4.9, 3.2, 3.7, 5.1, 6.4, 8.8, 1.8, -0.5, 4.5, 3.4…
## $ HrAvg           <dbl> 86.4, 94.9, 98.6, 98.4, 98.4, 98.3, 95.6, 84.2, 70.0, …
## $ WindkmhDir      <chr> "WSW", "WSW", "W", "SW", "S", "WSW", "W", "W", "WNW", …
## $ WindkmhInt      <dbl> 22.7, 16.7, 11.4, 5.5, 6.3, 9.3, 15.4, 16.4, 36.8, 28.…
## $ WindkmhGust     <dbl> 42.6, 40.8, 22.2, 14.8, 16.7, 22.2, 31.5, 50.0, 70.4, …
## $ PresslevHp      <dbl> 1025.3, 1028.5, 1028.5, 1031.8, 1034.7, 1033.6, 1026.9…
## $ Precmm          <dbl> 0.0, 0.0, 0.4, 0.4, 0.4, 0.4, 0.0, 0.0, 0.8, 0.8, 1.0,…
## $ TotClOct        <dbl> 4.5, 8.0, 8.0, 8.0, 8.0, 8.0, 6.8, 6.7, 4.3, 6.6, 4.6,…
## $ lowClOct        <dbl> 7.2, 8.0, 8.0, 8.0, 8.0, 8.0, 6.8, 7.6, 5.2, 6.9, 5.5,…
## $ SunD1h          <dbl> 5.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.4, 2.8, 0.0, 0.3,…
## $ VisKm           <dbl> 63.4, 15.3, 0.5, 0.1, 0.5, 0.2, 13.3, 20.0, 38.8, 34.9…
## $ Date_parsed     <date> 2024-12-31, 2024-12-30, 2024-12-29, 2024-12-28, 2024-…
## $ year_month      <chr> "2024-12", "2024-12", "2024-12", "2024-12", "2024-12",…
## $ month_num       <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12…
## $ temp_season     <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Win…

note: that both temp_season and crime_season are interchangeable because they are based on month_num which both data sets have 12 months all from 2024 and season mapping was mapped following same month definitions identically for each data set.

Exploratory Data Analysis

#one way TABLE: crime counts per category in descending order 
crime_table <- crime%>%
  count(category, sort=TRUE)

knitr::kable(crime_table, caption = "Crime category counts in Colchester (2024)")
Crime category counts in Colchester (2024)
category n
violent-crime 2420
anti-social-behaviour 710
shoplifting 629
criminal-damage-arson 479
public-order 458
other-theft 412
vehicle-crime 270
drugs 265
burglary 171
bicycle-theft 149
other-crime 100
theft-from-the-person 91
robbery 85
possession-of-weapons 65
#one way TABLE: weather averages by season 
temp_table <- temp %>%
  group_by(temp_season) %>%
  summarise(
    "avg_temp (°C)" = round(mean(TemperatureCAvg),2), #daily avg to seaonal avg
    "max_temp (°C)" = round(mean(TemperatureCMax),2), #daily max to seaonal max
    "min_temp (°C)" = round(mean(TemperatureCMin),2),
    "total_rainfall (mm)" = sum(Precmm), #total rainfall per season
    "avg_sunshine (hours)" = round(mean(SunD1h),2), #avg daily sunshine in hours
    .groups = "drop") #drop all other variables
 
knitr::kable(temp_table, caption = "Weather Averages by season (2024)")
Weather Averages by season (2024)
temp_season avg_temp (°C) max_temp (°C) min_temp (°C) total_rainfall (mm) avg_sunshine (hours)
Autumn 11.22 15.02 7.18 138.6 3.32
Spring 10.22 14.38 5.86 194.2 4.65
Summer 16.33 21.67 10.09 122.0 6.95
Winter 6.12 9.18 2.78 187.4 1.86
#TWO WAY TABLE OF crime category vs temp_season:
#answers, question: in what season does x crime occur?
#which crimes occur in each season?
#which season had the most crime?
two_way_table <- crime%>%
  count(category, crime_season, name= "n")%>%
  
#pivot to wider format for seasons to be columns & crime count for each season in rows
  pivot_wider(names_from = crime_season, values_from = n)%>%
  
#add row total for category 
  mutate(
    Total = rowSums(across(where(is.numeric))))%>% 
  

#add column total for seasons (sum of crimes per season)
bind_rows(
    summarise(., 
              category= "Seasons_Totals",
              across(where(is.numeric), sum)))%>%  #Sum all numeric column
  
arrange(desc(Total)) #sort by most frequent crime category

knitr::kable(two_way_table, caption = "Crime Category counts by Season in Colchester (2024)")
Crime Category counts by Season in Colchester (2024)
category Autumn Spring Summer Winter Total
Seasons_Totals 1565 1541 1631 1567 6304
violent-crime 595 565 618 642 2420
anti-social-behaviour 170 216 174 150 710
shoplifting 185 149 137 158 629
criminal-damage-arson 96 143 134 106 479
public-order 112 99 144 103 458
other-theft 100 110 102 100 412
vehicle-crime 57 47 108 58 270
drugs 65 66 48 86 265
burglary 50 34 43 44 171
bicycle-theft 60 25 30 34 149
other-crime 22 34 22 22 100
theft-from-the-person 19 19 27 26 91
robbery 24 16 26 19 85
possession-of-weapons 10 18 18 19 65
#TWO WAY TABLE: for crime type counts per month 
tw_table <- crime%>%
  count(category, month_num, name= "m")%>% #confirm that there are no missing values by adding , na.rm=TRUE
  
#pivot to wider format for seasons to be columns & crime count for each season in rows
  pivot_wider(names_from = month_num, values_from = m)%>%
  
#add row total for category 
  mutate(
    Total = rowSums(across(where(is.numeric))))%>% 
  arrange(desc(Total),na.rm=TRUE) #sort by most frequent crime category

knitr::kable(tw_table, caption = "Crime Category counts by month in Colchester (2024)")
Crime Category counts by month in Colchester (2024)
category 1 2 3 4 5 6 7 8 9 10 11 12 Total
violent-crime 213 220 188 163 214 192 242 184 223 195 177 209 2420
anti-social-behaviour 42 64 66 70 80 63 53 58 58 56 56 44 710
shoplifting 48 49 50 40 59 42 58 37 47 64 74 61 629
criminal-damage-arson 33 46 37 43 63 44 51 39 33 33 30 27 479
public-order 43 36 34 33 32 42 49 53 39 37 36 24 458
other-theft 34 30 35 34 41 34 33 35 32 38 30 36 412
vehicle-crime 16 29 20 14 13 15 41 52 17 27 13 13 270
drugs 28 24 29 25 12 12 17 19 25 21 19 34 265
burglary 19 15 11 10 13 9 18 16 8 17 25 10 171
bicycle-theft 11 8 7 12 6 9 12 9 12 19 29 15 149
other-crime 11 5 12 10 12 6 7 9 6 12 4 6 100
theft-from-the-person 11 6 5 6 8 7 12 8 4 7 8 9 91
possession-of-weapons 9 6 5 5 8 6 5 7 5 3 2 4 65
robbery 11 8 3 6 7 9 10 7 10 8 6 NA NA

Observations:

First table: Crime category counts in Colchester (2024)

Second table: Weather Averages by season (2024)

Third table: Crime counts by season in Colchester (2024)

Fourth table: Crime Category counts by month in Colchester (2024)

Visualisations

Bar plot

#crime counts per category bar plot 
gg_crime_bar <- crime %>%
  count(category) %>%
  ggplot(aes(x = reorder(category, n), y = n, fill = category), alpha=0.9)+
  geom_col(show.legend = FALSE) + # Remove legend 
  coord_flip() +  # Horizontal bars
  labs(title = "Crime Frequency by category in Colchester 2024", x = "Crime Category", y = "Frequency")+
  theme_minimal()

#Interactive plot 
ploty_crime_bar<- ggplotly(gg_crime_bar)%>% layout(showlegend = FALSE)
ploty_crime_bar

Observation for bar plot: Crime Frequency by category in Colchester 2024

  • Violent crime occurred most frequently.

  • Possession of weapons occurred least frequently.

Histogram and Density plots

#Temperature distribution
avg_temp_hist <- ggplot(temp, aes(x = TemperatureCAvg)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "Temperature Distribution in Colchester (2024)", x = "Temperature (°C)")+
  theme_minimal()

ggplotly(avg_temp_hist)
#Max Temperature distribution
max_temp_hist <- ggplot(temp, aes(x = TemperatureCMax)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "Maximum Temperature Distribution in Colchester (2024)", x = "Maximum Temperature (°C)")+
  theme_minimal()
ggplotly(max_temp_hist)
#Min Temperature distribution
min_temp_hist <- ggplot(temp, aes(x = TemperatureCMin)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "Minimum Temperature Distribution in Colchester (2024)", x = "Minimum Temperature (°C)")+
  theme_minimal()
ggplotly(min_temp_hist)
#TdAvgC, (average dew point Temperature) distribution
TdAvgC_hist <- ggplot(temp, aes(x = TdAvgC)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "Average Dew point Temperature Distribution in Colchester (2024)", x = "Average Dew point Temperature (°C)")+
  theme_minimal()
ggplotly(TdAvgC_hist)
#HrAvg - average relative humidity. Values given in %
HrAvg_hist <- ggplot(temp, aes(x = HrAvg)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "Average Relative Humidity Distribution in Colchester (2024)", x = "Average Relative Humidity ( %)")+
  theme_minimal()
ggplotly(HrAvg_hist)
#Viskm - visibility in kilometres
Viskm_hist <- ggplot(temp, aes(x = VisKm)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "visibility Distribution in Colchester (2024)", x = "visibility (km)")+
  theme_minimal()
ggplotly(Viskm_hist)
#WindkmhInt - wind speed in km/h
WindkmhInt_hist <- ggplot(temp, aes(x = WindkmhInt)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "Wind Speed Distribution in Colchester (2024)", x = "Wind Speed (km/h)")+
  theme_minimal()
ggplotly(WindkmhInt_hist)
#WindkmhGust - wind gust in km/h
WindkmhGust_hist <- ggplot(temp, aes(x = WindkmhGust)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "Wind Gust Distribution in Colchester (2024)", x = "Wind Gust (km/h)")+
  theme_minimal()
ggplotly(WindkmhGust_hist)
#PresslevHp - Sea level pressure in hPa
PresslevHp_hist <- ggplot(temp, aes(x = PresslevHp)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "Sea Level Pressure Distribution in Colchester (2024)", x = "Sea Level Pressure (hPa)")+
  theme_minimal()
ggplotly(PresslevHp_hist)

Observation of histograms:

  • Most weather variables followed approximately normal distributions.

  • Humidity and pressure distributions were skewed to the right.

#Precmm - precipitation totals in mm
rainfall_denisty <- ggplot(temp, aes(x = Precmm)) +
  geom_density(fill = "blue", alpha=0.5) +
  labs(title = "Total rainfall Distribution in Colchester (2024)", x = "Precipitation (mm)")+
  theme_minimal()
ggplotly(rainfall_denisty)
#TotClOct - total cloudiness in octants
cloudiness_denisty <- ggplot(temp, aes(x = TotClOct)) +
  geom_density(fill = "blue", alpha=0.5) +
  labs(title = "Total Cloudiness Distribution in Colchester (2024)", x = "Cloudiness (octants)")+
  theme_minimal()
ggplotly(cloudiness_denisty)
#lowClOct - cloudiness by low level clouds in octants
low.cloudiness_denisty <- ggplot(temp, aes(x = lowClOct)) +
  geom_density(fill = "blue", alpha=0.5) +
  labs(title = "Cloudiness by low level clouds Distribution in Colchester (2024)", x = "Cloudiness by low level clouds (octants)")+
  theme_minimal()
ggplotly(low.cloudiness_denisty)
#SunD1h - sunshine duration in hours
sun_density <- ggplot(temp, aes(x = SunD1h)) +
  geom_density( fill = "blue", alpha=0.5) +
  labs(title = "Sun Distribution in Colchester (2024)", x = "Sun (hours)")+
  theme_minimal()
ggplotly(sun_density)

Observation for denisty plots: Total rainfall Distribution in Colchester (2024): Density Observations:

  • Rainfall density peaked near 0 mm, indicating mostly dry days.

  • Cloudiness density increased with higher coverage.

  • Sunshine density peaked near zero, reflecting many overcast days.

Violin plot

#merge data sets:

# 1. Summarize crime data per month
crime_monthly <- crime %>%
  group_by(year_month, crime_season, category) %>%
  summarise(
    crime_count = n(),.groups = "drop")

# 2. Summarize temperature data per month
temp_monthly <- temp %>%
  group_by(year_month, temp_season) %>%
  summarise(
    avg_temp = mean(TemperatureCAvg),
    max_temp = mean(TemperatureCMax),
    min_temp = mean(TemperatureCMin),
    rainfall = mean(Precmm),
    sunshine = mean(SunD1h),
    .groups = "drop") 

# 3. Join both summaries
colchester_monthly <- left_join(crime_monthly, temp_monthly, by = "year_month")

# Check
glimpse(colchester_monthly)
## Rows: 167
## Columns: 10
## $ year_month   <chr> "2024-01", "2024-01", "2024-01", "2024-01", "2024-01", "2…
## $ crime_season <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Winter…
## $ category     <chr> "anti-social-behaviour", "bicycle-theft", "burglary", "cr…
## $ crime_count  <int> 42, 11, 19, 33, 28, 11, 34, 9, 43, 11, 48, 11, 16, 213, 6…
## $ temp_season  <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Winter…
## $ avg_temp     <dbl> 4.251613, 4.251613, 4.251613, 4.251613, 4.251613, 4.25161…
## $ max_temp     <dbl> 7.348387, 7.348387, 7.348387, 7.348387, 7.348387, 7.34838…
## $ min_temp     <dbl> 0.7419355, 0.7419355, 0.7419355, 0.7419355, 0.7419355, 0.…
## $ rainfall     <dbl> 1.748387, 1.748387, 1.748387, 1.748387, 1.748387, 1.74838…
## $ sunshine     <dbl> 2.832258, 2.832258, 2.832258, 2.832258, 2.832258, 2.83225…
#colchester_monthly <- crime %>%
  #left_join(temp, by = "year_month")  # Ensure datasets are merged

# crime counts vs sunshine violin plot
crime_sun_violin <- ggplot(colchester_monthly,
                          aes(x = factor(format(ym(year_month), "%b"),levels = month.abb),
                              y = crime_count, colour = factor(format(ym(year_month), "%b")))) +
  geom_violin(trim = FALSE) +
  labs(title = "Crime Distribution by Month in Colchester 2024",
       x= "Month",
       y= "Number of Crimes") +
  stat_summary(fun.y = median, geom='point')+
  theme_minimal()
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
crime_sun_violin <-crime_sun_violin + guides(color=FALSE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(crime_sun_violin) %>%
  layout(hoverlabel = list(bgcolor = "white"),
         xaxis = list(title = "Month"),
         yaxis = list(title = "crime count"))

Observation for Violin, Crime Distribution by Month in Colchester 2024:

  • Crime was fairly evenly distributed throughout the year.

  • Autumn/Winter months (September, october, November, December and January) showed greater spread, suggesting more variability.

Scatter and Correlation plots

sun.rain_scatter <- ggplot(temp, aes(x = SunD1h, y = Precmm)) +
  geom_point(alpha = 0.5, color = "red") +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(title = "Sunshine vs. Rainfall", y = "rainfall (mm)", x = " Sunshine (hours)")
ggplotly(sun.rain_scatter)
# Scatter plot: max vs min temperature 
max.min_scatter<-ggplot(temp, aes(x = TemperatureCMin, y = TemperatureCMax)) +
  geom_point(alpha = 0.5, color= "red") +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(title = "Scatter Plot of Daily Min vs Max Temperatures", 
       x = "Min Temperature (°C)", 
       y = "Max Temperature (°C)")
ggplotly(max.min_scatter)
# Scatter Plot of sunshine vs Max Temperatures
sun.max_scatter<- ggplot(temp, aes(x = SunD1h, y = TemperatureCMax)) +
  geom_point(alpha = 0.5, color= "red") +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(title = "Scatter Plot of Sunshine vs Max Temperatures", 
       x = "sunshine (hours)", 
       y = "Max Temperature (°C)")
ggplotly(sun.max_scatter)

observation for scatter plots: Sunshine vs. Rainfall:

  • weak negative association, lower percepititation (0.0 mm) seemed to correlate with higher sunshine hours (10.2). See plot steel blue trend.

Scatter Plot of Daily Min vs Max Temperatures:

  • strong positive association, higher min temperature seemed to correlate with higher max temperatures. See plot steel blue trend

Scatter Plot of Sunshine vs Max Temperatures:

  • moderate positive association, higher max temperatures seemed to correlate with higher sunshine hours. See plot steel blue trend.
#correlation analysis between temp variables and crime counts
#library(ggcorrplot)
numeric_cols <- colchester_monthly %>% select(where(is.numeric))

pval.cor<- cor_pmat(numeric_cols)

corrmat<-round(cor(numeric_cols),1) 
  
colch.monthly_cor<- ggcorrplot(corrmat, hc.order = TRUE, type="lower", 
                               p.mat=pval.cor, sig.level= 0.01, insig= "pch", 
                               pch=4, pch.col = "black")

ggplotly(colch.monthly_cor)

Observation for Correlation analysis between temp variables and crime counts revealed:

  • red, positive correlation

  • white, no correlation

  • purple, negative correlation

  • spaces marked with “X”, indicate no statistically significant correlation at defined alpha (0.01 or 0.05)

  • crime_count shows no strong linear correlation with the temp variables

  • while temp variables such as (avg_temp, min_temp, max_temp) are strongly positively correlated with one another. Expected, as temp variables represent different measures of the same underlying factor (temperature), and naturally react in tandem.

Time Series

#count crimes by date
crime_over_time <- crime %>%
  count(date_parsed, name = "crime_count")  # Count crimes per day

# Create the time series plot
crime_over_time_plot <- ggplot(crime_over_time, aes(x = date_parsed, 
                                                    y = crime_count)) +
  geom_line(color = "steelblue") +
  geom_point(color = "red") +
  labs(
    title = "Daily Crime Incidents in Colchester (2024)",
    x = "Date", 
    y = "Number of Crimes") +
  theme_minimal() +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#interactive
ggplotly(crime_over_time_plot)
#time series plot to observe the highest crime category over the year
# Create violent crimes data frame
v_df <- crime %>%
  filter(category == "violent-crime") %>%  # Filter for violent crimes
  group_by(year_month) %>% 
  summarise(total_v = n(), .groups = 'drop')  # Count occurrences

# Prepare variables for time series plotting
v.months <- ym(v_df$year_month)  # Convert to date object
v_crimes <- as.numeric(v_df$total_v)  # Total violent crimes by year_month to numeric for plot

# Create a data frame for plotting
v_plot_data <- data.frame(months = v.months, crimes = v_crimes)

# Plot using ggplot
v_ts.plot <- ggplot(v_plot_data, aes(x = months, y = crimes)) +
  geom_point(color = "red") +
  geom_line(color = "steelblue") +
  labs(
    title = "Violent Crimes in Colchester in 2024 by Month", 
       x = "Year",
       y = "Number of Violent Crimes") +
  theme_minimal()+
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") + #add individual coordinate xlabs 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
  

# Create interactive plot
v_ts.plotly <- ggplotly(v_ts.plot)
v_ts.plotly

Observation for time series of Violent Crimes in Colchester in 2024 by Month:

  • July had the highest violent crime count (242 incidents).

  • April had the lowest (162 incidents).

  • Crime rates were higher during summer months, while spring and autumn showed lower rates.

However, the pattern is not linear, Several months (for example: May and October) had declines in violent crime, indicating the presence of other influencing factors, potentially local events, law enforcement interventions, or socioeconomic conditions.

Overall, the data supports the broader narrative that crime rates, particularly violent crime, tend to increase during the summer months, reinforcing the rationale behind seasonal crime analysis in this report.

Map

  • Interactive Leaflet map of crime counts and locations
#visualising crime on colchester map 2024

crime_map<- crime%>%
  group_by(lat, long, street_name)%>%
  summarise(crimes=n(), .groups = 'drop')%>%  # Count occurrences
  
  #plot map
  leaflet()%>%
  addTiles()%>%
  setView(0.901028, 51.89301, zoom= 12)%>%
  addCircleMarkers(radius=~crimes*0.13, color= "red", 
                   popup = ~paste(street_name, "<br>Crimes: ", crimes))

crime_map

Observation for Interactive Leaflet map of crime counts and locations: * Hotspots for crime included town centre mainly on or near shopping areas (230 counts), the police station (166), on or near george street (140) and so on.

These areas likely experienced more crime due to higher pedestrian traffic, which increases the opportunity for offences such as theft and robbery.

A potential limitation of this analysis was that the specific types of crime occurring in each location were not explored. Future work could involve spatially filtering crime categories to identify whether certain areas are more prone to specific offences, using tools such as clustered point analysis or location-based filtering in Leaflet.

Key Findings

The findings revealed a seasonal pattern in crime rates, with summer emerging as the least safe period in 2024. July had the highest number of violent crime incidents, likely driven by increased social activity and warmer weather. Spring and winter recorded comparatively lower crime levels.

Although strong variable relationships were identified among weather features, no statistically significant linear relationships between weather variables and crime counts were observed. This implied that the connection between these weather factors and crime was indirect or non-linear.

Spatial analysis highlighted concentrated crime activity in Colchester’s central areas, indicative of the influence of geographic and demographic factors (for example: high streets, main roads, shopping malls and entertainment venues). These locations tend to attract larger crowds and provide greater anonymity for offenders, making the public more vulnerable to crimes such as theft, assault, or anti-social behaviour.

Conclusion

This analysis confirmed that crime in Colchester displayed seasonal variation, with the summer months representing a period of higher crime risk. However, the absence of significant linear correlations between these weather and crime features suggests that other social and environmental factors may have influence.

Based on the findings, it is recommended that crime prevention efforts focus more intensively on the summer months and in urban hotspots. Further research should consider incorporate socio-economic data and employ non-linear models to more effectively capture the underlying drivers of crime. Additionally, future studies could investigate the types of crimes committed at specific locations to better tailor interventions.

References

https://www.datacamp.com/cheat-sheet/xts-cheat-sheet-time-series-in-r https://plotly.com/r/reference/#Layout_and_layout_style_objects